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Abstract 

We study a class of multiplicative algorithms introduced by Silvey et al. (1978) for com- 
puting D-optimal designs. Strict monotonicity is established for a variant considered by 
Titterington (1978). A formula for the rate of convergence is also derived. This is used to 
explain why modifications considered by Titterington (1978) and Dette et al. (2008) usually 
converge faster. 
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1 Introduction 

Let X = {xi, . . . , x n } C R m be a design space of n points (n > m). We consider computational 
aspects of D-optimal design (approximate theory) for linear models (Kiefer 1974; Silvey, 1980; 
Pazman, 1986; Pukelsheim, 1993). The D-criterion seeks to maximize the determinant of the 
m x m matrix 

n 

M(w) = WjXjxJ 
i=l 

with respect to w = {w\, . . . ,w n ) T G where Q denotes the closure of = {w : Y^i=i w i = 
1, Wi > 0}. As usual, M(w) represents the Fisher information for the m x 1 parameter 6 in the 
linear model 

y\(x,e)~N(x T 6,a 2 ) 
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when the number of units assigned to Xi is proportional to iWj. An iterative procedure to solve 
this problem (Silvey et al. 1978) is as follows. 
Algorithm I 

1. Set™(°) = (>f,...,4V GO. 

2. For t = 1,2, . . ., compute 

w)/ =w\ '— — , i = l,...,n. (1) 

m 

Iterate until convergence. 

Algorithm II (Titterington 1978), a variant of Algorithm I, can be applied when the design 
points include an intercept, i.e., Xi = (l,zj) , where Zi G R m_1 . 
Algorithm II 

1. The same as Step 1 of Algorithm I. 

2. For t = 1,2..., compute 

n n 



i=l i=l 



(t) _ (t-i) - *) T M- V t-1) )(* " ^) 



w\' =w\ , t = l,...,n. 2 

m — 1 

Iterate until convergence. 

In a form that resembles (pQ) even more closely, ([2]) reads 

(t) (t-i)xj ' M- l (w^)xi- a . 

w\> =w\ '— , i = l,...,n, (3) 

m — a 

with a = 1. Note that ([3]) does not require that the design points include an intercept, and 
therefore can be more broadly applicable than (|2|). In what follows Xi need not include an 
intercept when we refer to ([3]). 

Algorithms I and II have generated considerable interest; see, for example, Titterington 
(1976, 1978), Silvey et al. (1978), Pazman (1986), Torsney and Mandal (2006), Harman and 
Pronzato (2007), Dette et al. (2008), and Yu (2010). Algorithm I is known to be monotonic 
(Titterington 1976), i.e., det M(w^) never decreases in t. Monotonicity of Algorithm II has 
been resolved recently (Titterington 1978; Yu, 2010). Part of this work aims to extend this to 



strict monotonicity, thereby showing that Algorithm II converges monotonically for m > 3, and 
fully resolving Titterington's (1978) conjecture. 

It has been observed that Algorithm II usually converges faster than Algorithm I; see, e.g., 
Dette et al. (2008). Another goal of this work is to give an explanation of this by a theoretical 
analysis of the convergence rates. Our investigation is partly inspired by Dette et al. (2008), 
who propose an iteration of the form of (j3J) with a dynamic choice of a = a®. These authors 
provide an upper bound on which ensures the monotonicity of (J3]), and also observe that 
their algorithm converges faster than Algorithm I in numerical examples. We shall also discuss 
the convergence rate of this dynamic algorithm. 

Section 2 establishes the strict monotonicity of Algorithm II. The argument extends that of 
Yu (2010). In Section 3, we analyze iteration (|3j) for fixed a in terms of both the matrix rate 
and the global rate. For Algorithm I (i.e., a = 0), it is shown that the matrix rate has only 
nonnegative eigenvalues. Combined with a simple relation between the convergence rates of 
for different a, this shows that, with some exceptions, iteration (jSJ) with a > will converge 
faster than Algorithm I. Section 4 concludes with a small numerical illustration. 

2 Strict Monotonicity of Algorithm II 

Theorem 1 of Yu (2010) implies the monotonicity, but not strict monotonicity, of iteration 
(J2J). In Proposition [1] below, we establish strict monotonicity for m > 3. We include the 
proof of monotonicity for completeness, but our emphasis is on the equality condition. Strict 
monotonicity plays a key role in the proof of the convergence theorem (Theorem [1]) . 

Let us denote Q + = {w 6 Cl : M(w) > (positive definite)}. In this section we assume 
X{ = (1, zJ) T , and write X = (x\, . . . , x n ) T . 

Proposition 1. Assume m > 3 and X has full rank m. Then iteration p|) is strictly monotonic. 
That is, if w^~ x \w^ € Vt + satisfy (dp, then det M(w^ t ~ 1 " > ) < det M(w®), with equality only if 

Proof. Let K = (0 m _i, I m -l) where r denotes the r x 1 vector of zeros, and I r denotes the 
r x r identity matrix. Define ip(M) = log det (KMK T ) for any positive definite m x m matrix 
M. Consider the function 

fc(E, w, Q) = V(X) + tr(^'(S)(QA- 1 Q T - £)) 
3 



where £ (m x m) is positive definite, «>£(!, A„, = Diag(u;), and Q (to x n) is full-rank. Because 
■ifj(M) is concave in M, and strictly concave when restricted to KMK T , we have 

h(^,w,Q)>^(QA- 1 Q T ) (4) 

with equality only when K(QA~ 1 Q T - S)K T = 0. 

On the other hand, suppose QX = I m , then we have 

^(QA- 1 Q T )>^(M" 1 H). (5) 

This holds because QY is an unbiased estimator of 8 in the linear model 

Y ~ N(X9, a.- 1 ). 

Hence its variance matrix QA~ 1 Q T is at least as large (in the positive definite ordering) as 
(X T AwX)^ 1 = M^ 1 (w), which corresponds to the weighted least squares estimator Qwls = 
(X T A W X)~ 1 X T A w . Moreover, equality in ([5]) holds only when K(Q — Qwls) = 5 i>e., when 
QY agrees with Qwls^ i n all coordinates except the first. 

Let wC-^mjW eflbe related by ©. Consider Q^ 1 ) = (X T A w(t _ 1) X)- 1 X T A w(t _ 1) and 
define Q^' similarly. We have 

tp(M~ 1 (w^ t ~ r) )) = h(M~ 1 (w^ t ~ 1 " > ),w^ t ~ 1 \Q^ t ~ r> ) 

= h{M- 1 {w( t -V),w®,Q( t - lS )) (6) 

>^(Q^A-J t) Q^ T ) (7) 

> ip(M~ 1 (w^)). (8) 

The key is the equality in ©, which follows from ([2]) after some algebra. The inequality ([7]) 
follows from The inequality © follows from ([5]). We also have the easily verified identity 
ip{M~ 1 {w)) = — log det M(w). Thus the monotonicity statement holds. 

To prove strict monotonicity, let us check the equality conditions in ([7]) and ([8]) . The equality 
in ([7]) entails 

#(M-V (t-1) ) - Q (t ~ 1)A J t) Q {t ~ 1)T ) KT = 0- (9) 

The equality in (JSj) entails 

if(Q(*-i) _ q(*)) = o, (10) 
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which implies 

KQ^Aj t) Q^ T K T = KQ^Aj t) Q^ T K T = KM~ 1 (w^)K T . 

We obtain 

K(M~ 1 (w^) - Af _1 (r«C*-i) ))ir" r = (11) 
in view of Q. After some calculation, we can show that (jlOp and ([lip imply 

X T (A w{t ^ - w^w^ T ) = X T (A wit) - w^wW). 

Equivalently, 

wf- 1) (x l -x^) = wf ) (x l -x^), i = l,...,n, (12) 

where = Yli=i w i x ii anc ^ x^~ 1 ^ is defined similarly. If wf = wf for any i, then 
x^ t_1 ) = = S, and (uw^* ^ — Wj)(xj — x) = for all j. That is, either Xj — S = 0, or 
Wj" = Wj. If Xj — x = 0, then Wj = by the form of ([2]), which contradicts the assumption 
that € fi. Hence, if wf ^ = for any i, then it holds for all i. Let us assume wf ^ ^ wf^ 
for all i. Rewriting f)12|) we get 

(A w(t _ 1) -A w(t) )X = (^*- 1 ), - U ;W)(x^ 1 ),xW) T . (13) 

We obtain a contradiction because the left-hand side of (113p has rank m > 3, whereas the 
right-hand side has rank at most two. It follows that iv® = u/* -1 ). 

Although the above argument assumes io(*~ 1 ),u;W G O, i.e., they have all positive coordi- 
nates, the conclusion still holds if we only assume w^ t ~ 1 \w^ G First, we can restrict our 
analysis to the positive coordinates of w^~ l \ If w^ 1 ^ G £7, but u;W has some zero coordinates, 
then we can show det M(u/* -1 )) < det M(w^) by a limiting argument, upon close inspection 
of ©. □ 

Remark. When m = 2, Algorithm II is still monotonic, but may not be strictly monotonic; 
see Pronzato et al. (2000), Chapter 7, and Section 3 below. 

Strict monotonicity leads to the following convergence theorem, which fully resolves Titter- 
ington's (1978) conjecture. 

Theorem 1. Assume m > 3 and X has full rank. Let iv® be a sequence generated by Algo- 
rithm II, starting with G SI- Then all limit points of are global maxima of det M(w) 
on w G f2 + and, as t increases to oo, det M(w^) increases to sup,„ g Q + detM(-w). 
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Yu (2010, Theorem 2) presents a convergence theorem for a general class of multiplicative 
algorithms. However, we cannot directly appeal to Theorem 2 in Yu (2010) because certain 
technical conditions are not satisfied. For example, need not have all positive coordinates 
even if wf ^ > for all i. However, inspection of ([2]) shows that wf is set to zero only when 
Z{ = z, in which case it can be shown that an optimal design need not include Xi as a support 
point, i.e., Xi is safely eliminated. Theorem [T] can then be proved by following the proof of 
Theorem 2 in Yu (2010) step by step (details omitted). 



3 Rate of Convergence 

In this section we analyze the convergence rate of iteration ([3]). Assume the matrix X = 
(xi, . . . , x n ) T has full rank m > 2. Let w* £ be a global maximizer of det M(w). We 
assume that w* has all positive components. (A slightly weaker assumption is that the starting 
value has the same zero pattern as w* .) Though unrealistic in a practical situation, such 
an assumption makes our analysis tractable. It seems to be a challenging problem to analyze 
the convergence rate when the algorithm tends to a boundary limit. In Section 4, we present 
numerical examples to corroborate our rather idealized analysis. 

The notions of the matrix rate and the global rate are often used in analyzing fixed point 
algorithms in statistical contexts (Dempster et al. 1977; Meng, 1994). Assume < a < m, and 
denote the mapping ([3]) by T. The matrix rate of convergence of T is defined as 

dT{w) 



R(a) 



dw 

because we have 

T{w) - w* ps R(a)(w - w*) (14) 

for w near w*. The global rate of convergence, r(a), is defined as the spectral radius (the 
maximum modulus of the eigenvalues) of R(a) when restricted as a linear mapping on the space 

r = { 7 GR m : l m7 = 0} 

where l m denotes the 1 x m vector of ones. Restricting R{a) to V is possible because of the 
implication 7 G V R(a) / y G T. This restriction is imposed because we have the constraints 
J2i w i = J2i w i = 1' an d hence w — w* G T in (fl~4"|) . Also note that such notions of convergence 
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rates merely reflect how the iterations of T behave near w*; whether the algorithm converges 
from an arbitrary starting value is a different issue. 

Let us define dij = xj M~ l (w*)xj, 1 < i,j < n. The matrix rate R(a) admits a simple 
formula. 

Proposition 2. The (i,j)th entry of R(a) is 

Rij{a) = < (15) 
[ {da - Widfi - a)/(m - a), i = j. 

Proof. By differentiating I m = M~ l (w)M{w) with respect to Wj and rearranging, we obtain 

dM- l (w) .dMiw)^ , , , Tl ., , 

; - M~ (w) v M~ (w) = -M (w)xjXj M- l (w), 



dwj dwj 
which yields, for i ^ j, 



dTi(w) _ Wi T 8M- 1 (w) Wid% 

~X; n Xi 



dwj m — a 1 dwj m — a 

The case of i = j is similar. □ 

Theorem [2] investigates the properties of R(a). Its proof uses a lemma concerning the 
Hadamard product (see, e.g., Pukelsheim (1993), p. 199). 

Lemma 1. If A = (Aij) and B = {Bij) are symmetric nonnegative definite matrices of the 
same dimension, then their entry-wise product C = ((AijBij)) is also nonnegative definite. 

Theorem 2. The matrix R(a) is diagonalizable, and all of its eigenvalues lie in the interval 
[—a/(m — a), 1]. 

Proof. We have da = m for all i by the general equivalence theorem (Kiefer and Wolfowitz, 
1960). (Note the assumption that w* has all positive components.) Thus 

A * D* 

R{a)=I m -^L±L, (16) 
m — a 

where A w * = Diag(w*) as before, and D* = {df-) nxn . Define D = (dij) nxn . The formula 
D = XM~ l {w*)X T shows that D is nonnegative definite, and so is D* by Lemma [TJ since D* is 

1/2 1/2 

the entry-wise product of D with itself. By (fTB]) . I m — R(a) is similar to Aj» D*AJ I * /(m — a), 
which is nonnegative definite because D* is. Hence I m — R(a) is diagonalizable, with nonnegative 
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eigenvalues. That is, R(a) is diagonalizable, and its eigenvalues do not exceed one. On the other 
hand, A W *D* satisfies (i) each entry is nonnegative, and (ii) each column sums to ^2iW*dfj = 
djj = m. By the Probenius-Perron theorem (see Horn and Johnson (1990), Chapter 8), any 
eigenvalue of A W *D* cannot exceed m. It follows from (|16p that any eigenvalue of R{a) is at 
least 1 — m/{m — a) = —a/{m — a). □ 

For Algorithm I, Theorem [2] leads to an eigenvalue bound similar to that for the EM algo- 
rithm; see Yu (2009) for another similar situation in the context of Shannon theory. 

Corollary 1. All eigenvalues of R(0) lie in the interval [0,1]. 

Because Algorithm I converges, it is not surprising that eigenvalues of R(0) do not exceed 
one. However, that these eigenvalues are nonnegative shows that the iterations of Algorithm I 
are conservative, and may be improved by some form of overrelaxation, e.g., by using a > in 
([3]). To make this intuition precise, we compare convergence rates of iteration ([3]) for different 
a (with respect to the same w*). Equation f)16|) yields 

m 

I m -R(a) = (I m -R(0)). (17) 

m — a 

If we define I m — R(a) as the matrix speed of convergence, then (|17p has an appealing inter- 
pretation: iteration ([3]) is precisely m/(m — a) times as fast as Algorithm I. Nevertheless, one 
should be cautious toward such an interpretation. First, we need to assume that (|3j) converges, 
which is not always guaranteed. Secondly, when some of the eigenvalues of R(a) are negative, 
iteration ([3]) can actually be slower than Algorithm I, as the following example illustrates. Let 
n = m = 2 and consider the design space X = {x\ = (1, — 1) T , X2 = (1,1) T }. Iteration ([3]) 
maps any u/* -1 ) = (w\,W2) T G £1 to 

= (1 — awi, 1 — aw2) J ■ 

2 — a 

We have R(a) = —0/2/(2 — a). The algorithm reaches w* = (1/2,1/2) T in one iteration if 
a = 0, but becomes slower and slower as a increases from to 1. When a = 1 it alternates 
between two points (wi,u>2) T and (u> 2 ,wi) T (assuming w\ 7^ u^) and does not even converge. 
(Non-convergence of Algorithm II when m = 2 has been noted by Pronzato et al. (2000).) 

What (|17p does imply is that, if a is not too large, and if Algorithm I itself is slow, then 
iteration ([3]) will converge faster than Algorithm I. Intuitively, an a too large would overshoot 

8 



and slow the algorithm down. Proposition [3] makes this explicit by comparing the global rate 
r(a). 

Proposition 3. Assume r(0) > 2a/m. Then 

m 

l-r(a) = (l-r(O)), (18) 

m — a 

and hence r(a) < r(0). 

Proof. Let r + {a) (resp. r_(a)) denote the largest (resp. smallest) eigenvalues of R(a) when 
restricted as a linear mapping on T. Then r(a) = max{|r + (a)|, |r_(a)|}. Corollary Q] implies 
r(0) = r + (0). By (HZ]), 

/x m , . . . m — 2a 

1 _ r+ (a) = (1 - r(0) < . 

m — a m — a 

That is, r + (a) > a/{m — a). On the other hand, Theorem [2] implies r„(a) > —a/{m — a). 
Hence r + (a) > [r_(a)|, and r{a) = r + (a), thus proving (fT8|) . □ 

Corollary 2. // r(0) > 2/m, t/ien the global rate of Algorithm II is no worse than that of 
Algorithm I. 

Corollary [2] suggests that Algorithm II is likely to converge faster than Algorithm I, as long as 
m > 3 and r(0) is reasonably large. Note that in a practical situation, Algorithm I can be quite 
slow, i.e., r(0) is close to one. This explains the observed improvement of using Algorithm II in 
numerical examples. 

Dette et al. (2008) consider a version of (J3j) where a = cnW is set at each iteration. It is 
shown that by choosing 

a {t) = - min xj M _1 (u> ( * _1) ) Xi (19) 
2 i 

the resulting algorithm is monotonic, and usually converges faster than Algorithm I. Although 
this algorithm is dynamic, we can still discuss its asymptotic rate of convergence, because if 
t — > oo and w^> — > w*, then also tends to a limit: 

a = lim = — minxj M~ l (w*)xi. 

t^foo 2 i 

It follows that, for large t, each iteration of this dynamic algorithm behaves as if a is fixed at 
a. If w* has all positive components, then a = m/2 by the general equivalence theorem; in 
general < a < m/2. If a = m/2, and if (|18p holds, then we can loosely say that the dynamic 



algorithm is twice (m/(m — a) = 2) as fast as Algorithm I. In a practical situation, however, it 
is more likely that a < m/2, hence we may expect a less pronounced improvement; see Section 4 
for a numerical example. 



4 Numerical Example 

The formula f)18|) is derived under the assumption that all coordinates of w* are positive. As 
mentioned earlier, in realistic problems this is usually not true. It is therefore reasonable to ask 
whether (I18p holds in any practical sense. To study this, we employ an empirical measure of the 
convergence rate, defined as 

r = lim L-if (20) 

where \v\ = (Yli^) 1 ^ 2 ■ We compare the f for iteration ([3]) with different values of a for a few 
regression models. Define Si = i/20, i = 1, . . . , 20. Similar to Dette et al. (2008), we consider 
design spaces 

X 1 = { Xi = (1, e~ s ', Si e- S ') T , i = l,..., 20}; 

<*2 = {xi = (1, Si/(K + Si), Si/(K + Si) 2 ) T , i = 1, . . . , 20}, k = 0.5; 
X 3 = { Xl = (1, s h si sf) T , i = l,..., 20}. 

Note that a D-optimal design on Xi is equivalently a locally D-optimal design for the parameter 
(/?0j/?i,k) in the nonlinear model, 

y = Po + ^ + e, e~N(0, a 2 ), (21) 

K + S 

where the design space for s is {sj = i/20, i = 1, . . . , 20}, and the prior guess for k is k* = 0.5. 
Mathematically, (|2ip with /3o = corresponds to the Michaelis-Menten model often employed 
to describe enzyme kinetics. 

Table 1 records the estimates of 1 — f for iteration ([3]) with various choices of a (fixed or 
dynamic). Each algorithm is started from the uniform design (uij = 1/20, i = 1, . . . , 20), and f is 
estimated by the ratio on the right hand side of (|20|) when it stabilizes. The first three columns of 
Table 1 deal with fixed a, in which case we write f = r(a). If we interpret 1 — r as the empirical 
speed of convergence, then evidently larger values of a improve the speed. For X\ and X2, the 
ratio of improvement, (1 — f(a))/(l — r(0)), is approximately equal to m/(m — a), a = 0.5, 1. 
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Table 1: Values of 1 — f (the empirical speed) for iteration ([3]) with several design spaces and 
choices of a. Dynamic a refers to the algorithm of Dette et al. (2008). 





a = 


a = 0.5 


a = 1 


dynamic a 




0.0168 


0.0202 


0.0252 


0.0245 


x 2 


0.0177 


0.0212 


0.0264 


0.0256 


x 3 


0.0062 


0.0068 


0.0076 


0.0082 



For X%, this ratio is below the value suggested by (|18p for either a = 0.5 or a = 1. However, it is 
possible that accurate estimation of the ratio of improvement becomes more difficult because the 
algorithms are much slower for X% than for X\ or X 2 . The last column concerns the algorithm 
of Dette et al. (2008) where o> is set dynamically as in (I19j) . The limiting value lim^oo a® is 
estimated at a\ = 0.939 for X\, a 2 = 0.935 for X 2 , and Q3 = 1.303 for X3. We observe that these 
agree well with what (|18|) suggests. For example, the ratios of improvement for the dynamic 
algorithm are 

0.0245 _ 3 ^ 0.0256 ^ 3 
0.0168 ~ 3-di an 0.0177 ~ 3 - a 2 

for X\ and X 2 respectively. Overall, we believe that f)18|) remains suggestive of how much 

iteration (J3J) can improve upon Algorithm I in realistic situations. 
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